Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: use EncodedError in SnapshotResponse #75248

Merged
merged 1 commit into from
Nov 3, 2022

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 20, 2022

We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
SnapshotResponse were considered snapshot errors and were considered
retriable. This was causing TestLearnerSnapshotFailsRollback to run
for 90s, as TestCluster's replication changes use a SucceedsSoon to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast. Now, since snapshot error marks propagate over the
wire, we can do the marking on the sender of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing errMalformedSnapshot, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the unused lint.

Fixes #74621.
Fixes #87337.

Release note: None

@tbg tbg requested a review from a team as a code owner January 20, 2022 21:42
@tbg tbg requested a review from lidorcarmel January 20, 2022 21:42
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/replica_command.go, line 2512 at r2 (raw file):

		return err
	}
	return nil

maybe just:

	return r.store.cfg.Transport.SendSnapshot(
		ctx,
		r.store.allocator.storePool,
		req,
		snap,
		newBatchFn,
		sent,
	)

instead of everything? or maybe you like the existing pattern :)

Suggestion:

	if err := r.store.cfg.Transport.SendSnapshot(
		ctx,
		r.store.allocator.storePool,
		req,
		snap,
		newBatchFn,
		sent,
	); err != nil {
		return err
	}
	return nil

@tbg tbg requested a review from lidorcarmel January 21, 2022 08:18
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, Lidor!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)


pkg/kv/kvserver/replica_command.go, line 2512 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

maybe just:

	return r.store.cfg.Transport.SendSnapshot(
		ctx,
		r.store.allocator.storePool,
		req,
		snap,
		newBatchFn,
		sent,
	)

instead of everything? or maybe you like the existing pattern :)

Done.

Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)

tbg added a commit to tbg/cockroach that referenced this pull request Mar 2, 2022
These take >1m under race and are now skipped for the race build.

Note that TestLearnerSnapshotFailsRollback takes 90s even under ideal
conditions, though that is fixed once I get cockroachdb#75248 over the finish
line. At that point it probably wouldn't figure prominently as a
slow test, but until then it's verrrry slow (>7m, probably since
race stuff slows down the longer it runs) so still good to skip
it now.

https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1646221664032729

Release justification: testing-only change
Release note: None
craig bot pushed a commit that referenced this pull request Mar 3, 2022
77272: kvserver: skip some slow tests under race r=erikgrinaker a=tbg

These take >1m under race and are now skipped for the race build.

Note that TestLearnerSnapshotFailsRollback takes 90s even under ideal
conditions, though that is fixed once I get #75248 over the finish
line. At that point it probably wouldn't figure prominently as a
slow test, but until then it's verrrry slow (>7m, probably since
race stuff slows down the longer it runs) so still good to skip
it now.

https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1646221664032729

Release justification: testing-only change
Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg force-pushed the snaps-enc-err branch 2 times, most recently from e01eb26 to 8380321 Compare March 17, 2022 19:16
Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd noticed this recently while working on #77246 but couldn't see an easy way to work around it. TIL about errors.EncodedError.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg should we merge this?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)

@tbg
Copy link
Member Author

tbg commented Mar 28, 2022

I had put it off a bit to avoid changing master code in case we'd need some backport for the allocator. If you think this is fine to merge now I would go ahead and do so. I would certainly like to get it in to save 90s per test run :-)

@aayushshah15
Copy link
Contributor

It :lgtm: from that angle, so I'd be in favour of merging to master. This PR should speed up all the tests that induce failures to replication changes.

@tbg
Copy link
Member Author

tbg commented Mar 30, 2022

bors r=aayushshah15

@craig
Copy link
Contributor

craig bot commented Mar 30, 2022

Build failed (retrying...):

@tbg
Copy link
Member Author

tbg commented Mar 30, 2022

bors r-

Looks like I need to shake out some flakes.

@craig
Copy link
Contributor

craig bot commented Mar 30, 2022

Canceled.

@tbg tbg force-pushed the snaps-enc-err branch 3 times, most recently from 8bf769d to 61d8135 Compare September 28, 2022 20:28
@tbg
Copy link
Member Author

tbg commented Sep 28, 2022

Rebased. Tests seem to pass reliably (at least kvserver package), but this doesn't seem good:

--- PASS: TestReplicateQueueSwapVotersWithNonVoters (80.01s)

Going to take another look at slow tests before merging this. I bet lots of those aren't new:

$ less slow-tests.txt | grep -E '^... PASS: .*\([1-9][0-9]\.[0-9]+s\)$'
--- PASS: TestOptimisticEvalWithConcurrentWriters (10.54s)
--- PASS: TestLargeUnsplittableRangeReplicate (12.66s)
--- PASS: TestProtectedTimestamps (13.14s)
--- PASS: TestGetChecksumNotSuccessfulExitConditions (11.11s)
--- PASS: TestReplicateQueueShouldQueueNonVoter (15.74s)
--- PASS: TestReplicateQueueMetrics (23.49s)
--- PASS: TestSplitBurstWithSlowFollower (13.46s)
--- PASS: TestResetQuorum (42.88s)
--- PASS: TestReplicateQueueUpAndDownReplicateNonVoters (14.29s)
--- PASS: TestClosedTimestampCantServeForNonTransactionalBatch (19.95s)
--- PASS: TestClosedTimestampFrozenAfterSubsumption (19.99s)
--- PASS: TestReplicateRestartAfterTruncationWithRemoveAndReAdd (11.29s)
--- PASS: TestReplicateQueueDecommissioningNonVoters (28.87s)
--- PASS: TestNonVoterCatchesUpViaRaftSnapshotQueue (10.56s)
--- PASS: TestClosedTimestampCantServeWithConflictingIntent (10.27s)
--- PASS: TestClosedTimestampCanServeAfterSplitAndMerges (10.26s)
--- PASS: TestLearnerRaftConfState (12.83s)
--- PASS: TestSystemZoneConfigs (22.74s)
--- PASS: TestClosedTimestampCanServeThroughoutLeaseTransfer (10.95s)
--- PASS: TestReplicateQueueSwapVotersWithNonVoters (86.40s)
--- PASS: TestDecommission (14.99s)
--- PASS: TestBackpressureNotAppliedWhenReducingRangeSize (31.75s)
--- PASS: TestStoreRangeMergeTimestampCache (34.41s)
--- PASS: TestMergeQueueSeesNonVoters (21.09s)
--- PASS: TestProcessSplitAfterRightHandSideHasBeenRemoved (10.07s)
--- PASS: TestConsistencyQueueRecomputeStats (35.57s)
--- PASS: TestClosedTimestampCanServeForWritingTransaction (10.07s)
--- PASS: TestSnapshotsToDrainingNodes (59.17s)
--- PASS: TestClosedTimestampCantServeBasedOnUncertaintyLimit (11.15s)
--- PASS: TestLearnerSnapshotFailsRollback (92.51s)

@tbg
Copy link
Member Author

tbg commented Sep 28, 2022

Ironically TestLearnerSnapshotFailsRollback still takes 45s for each of the two cases 😆 well I'll go through these tests.

@tbg tbg force-pushed the snaps-enc-err branch 2 times, most recently from 09da176 to 8310d5b Compare October 31, 2022 14:55
@tbg tbg requested a review from lidorcarmel October 31, 2022 14:55
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @lidorcarmel from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lidorcarmel)

@tbg
Copy link
Member Author

tbg commented Oct 31, 2022

PTAL @aayushshah15

=== RUN   TestLearnerSnapshotFailsRollback/learner
    --- PASS: TestLearnerSnapshotFailsRollback/learner (0.40s)
=== RUN   TestLearnerSnapshotFailsRollback/non-voter
    --- PASS: TestLearnerSnapshotFailsRollback/non-voter (0.39s)

@tbg tbg requested review from aayushshah15 and removed request for lidorcarmel October 31, 2022 16:20
Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @lidorcarmel)

We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
@tbg
Copy link
Member Author

tbg commented Nov 2, 2022

bors r=aayushshah15
TFTR!

@craig
Copy link
Contributor

craig bot commented Nov 3, 2022

This PR was included in a batch that timed out, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Nov 3, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 3, 2022

Build succeeded:

@craig craig bot merged commit d2171af into cockroachdb:master Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants